Quiz: Split Sentences

In this exercise, you will read in some text from a file, split the text into sentences, and then each sentence into words (tokens). Instead of using the built-in Python string method .split(), try using the Regular Expression package re.

Come up with an appropriate regular expression that matches sentence delimiters, and use it like this:

sentences = re.split(r"<your regexp>", text)

Note the 'r' preceding the regexp string - this denotes a raw string and tells Python not to interpret the characters in any special way (e.g. escape sequences like '\n' do not get converted to newlines, etc.).

Specifying word delimiters is also pretty easy. Refer to the re library documentation here for details.

Remember to remove leading and trailing spaces. If that results in any empty strings, drop them from the list that is returned.

Start Quiz:

tokenizer.py

"""Splitting text data into tokens."""

import re

def sent_tokenize(text):
    """Split text into sentences."""
    
    # TODO: Split text by sentence delimiters (remove delimiters)
    
    # TODO: Remove leading and trailing spaces from each sentence
    
    pass  # TODO: Return a list of sentences (remove blank strings)


def word_tokenize(sent):
    """Split a sentence into words."""
    
    # TODO: Split sent by word delimiters (remove delimiters)
    
    # TODO: Remove leading and trailing spaces from each word
    
    pass  # TODO: Return a list of words (remove blank strings)


def test_run():
    """Called on Test Run."""

    text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"
    print("--- Sample text ---", text, sep="\n")
    
    sentences = sent_tokenize(text)
    print("\n--- Sentences ---")
    print(sentences)
    
    print("\n--- Words ---")
    for sent in sentences:
        print(sent)
        print(word_tokenize(sent))
        print()  # blank line for readability

User's Answer:

(Note: The answer done by the user is not guaranteed to be correct)

tokenizer.py

"""Splitting text data into tokens."""

import re

def sent_tokenize(text):
    """Split text into sentences."""
    
    # TODO: Split text by sentence delimiters (remove delimiters)
    lines = re.split(r'\s*[!?.]\s*', text)
    
    # TODO: Remove leading and trailing spaces from each sentence
    for item in lines:
        if item == '':
            lines.remove(item)
    
    return lines  # TODO: Return a list of sentences (remove blank strings)


def word_tokenize(sent):
    """Split a sentence into words."""
    words = sent.split()
    # TODO: Split sent by word delimiters (remove delimiters)
    
    # TODO: Remove leading and trailing spaces from each word
    
    return words  # TODO: Return a list of words (remove blank strings)


def test_run():
    """Called on Test Run."""

    text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"
    print("--- Sample text ---", text, sep="\n")
    
    sentences = sent_tokenize(text)
    print("\n--- Sentences ---")
    print(sentences)
    
    print("\n--- Words ---")
    for sent in sentences:
        print(sent)
        print(word_tokenize(sent))
        print()  # blank line for readability

INSTRUCTOR NOTE:

Note: The nltk package is not available within programming quizzes, but you can use it going forward for labs and projects.

Next Concept